To appear in Maximum Entropy and Bayesian Methods, 

K. Hanson and R. Silver (eds), Kluwer Academic Publishers, 1995. 



CLUSTER EXPANSIONS AND ITERATIVE SCALING 
FOR MAXIMUM ENTROPY LANGUAGE MODELS 



JOHN D. LAFFERTY AND BFRNHARD SUHM 
School of Computer Science 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsburgh, PA 15217 USA^ 

Abstract. The maximum entropy method has recently been successfully intro- 
duced to a variety of natural language applications. In each of these applications, 
however, the power of the maximum entropy method is achieved at the cost of 
a considerable increase in computational requirements. In this paper we present 
a technique, closely related to the classical cluster expansion from statistical me- 
chanics, for reducing the computational demands necessary to calculate conditional 
maximum entropy language models. 

1. Introduction 

In this paper we present a computational technique that can enable faster cal- 
culation of maximum entropy models. The starting point for our method is an 
algorithm [1] for constructing maximum entropy distributions that is an extension 
of the generalized iterative scaling algorithm of Darroch and Ratcliff [2,3]. The 
extended algorithm relaxes the assumption of [2,3] that the constraint functions 
sum to a constant, and results in a set of decoupled polynomial equations, one for 
each feature, that must be solved to obtain the scaling terms. For each iteration, 
the distribution must be normalized (that is, the partition function must be cal- 
culated), and the coefficients of the polynomials must be determined; these steps 
have roughly the same computational cost. 

For language modeling applications the partition function and coefficient cal- 
culations entail summing over the target vocabulary, typically on the order of 
10,000-100,000 words, and determining those features that apply to each possible 
word for each context that appears in the training data. When this calculation is 
implemented directly by carrying out the summation while hashing to determine 
features and feature weights, it can be exceedingly slow. We address this problem 
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by use of a technique that we call the cluster expansion, due to its resemblance to 
series expansion methods in statistical physics, that carries out both the partition 
function and coefficient calculations efficiently. Our basic idea is to avoid hashing 
and an explicit summation over the entire target vocabulary for each context by 
calculating the partition function (or coefficients) for all contexts simultaneously 
as a telescoping sum of polynomials in the feature weights. By choosing the data 
structures in the implementation appropriately, the cluster expansion can be easily 
implemented for a class of language models that includes n-gram constraints in 
addition to state constraints from an underlying automaton, or other long-distance 
constraints. 

In this paper we present a description of the basic technique as well as its 
application to the construction of a simple language model for use in a speech 
recognition system. 

2. Language Modeling 

2.1. LANGUAGF MODFLS AS PRIORS FOR BAYFSIAN DFCODING 

Language modeling attempts to identify regularities in natural language and cap- 
ture them in a statistical model. Language models are crucial ingredients in auto- 
matic speech recognition [4] and statistical machine translation [5] systems, where 
their use is naturally viewed in terms of the noisy channel model from information 
theory. In this framework an information source emits messages X from a distri- 
bution P{X) which then enter into a noisy channel and emerge transformed into 
observables Y according to a conditional probability distribution P{Y \X). The 
problem of decoding is to determine the message X having the largest posterior 
probability given the observation: 

X = argmaxP(X I Y) = aigmaxP(Y \ X) P(X) . 
xen xen 

Thus, Bayesian decoding is carried out using a prior distribution P{X) on mes- 
sages, a channel model P{Y \ X), and a decoder argmax^g^. For speech recogni- 
tion and machine translation, the prior distribution is called a language model, and 
it must assign a probability to every string of symbols that can be hypothesized by 
the decoder. The most common language models used in today's speech systems 
are the n-gram models, constructed in terms of simple word frequencies. 

2.2. CONDITIONAL MAXIMUM ENTROPY LANGUAGE MODELS 

In the usual application of the maximum entropy principle [6], prior information, 
typically in the form of frequencies, is represented as a set of constraints which 
collectively determine a unique maximum entropy distribution. For example, if we 
observe certain bigram word frequencies Cij = p{wi Wj) and we constrain a lan- 
guage model to agree with these observations, the maximum entropy distribution 
assigns a probability p\{W) to a word string W according to a Gibbs distribution 
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of the form 

p,(P^) = J-exp (^X.,f.,{W)^ 

where the feature fij{W) counts the number of times the bigram WiWj occurs in 
the string W , and where the partition function Z\ is obtained by summing over 
all possible word strings W . 

In contrast to this use of the joint distribution, recent applications of the max- 
imum entropy method in language modeling [7,8] have employed conditional mod- 
els. Such models employ features to represent various frequencies in the training 
text, such as the bigram features just mentioned, but they use this information 
to constrain a family of conditional exponential models. Factoring a word string 
W = wqWi ■ ■ ■ into conditional probabilities we can write 

N N 

p{W) = p{wo)Y[p{wi I wqWi ■ ■ • = p(wo) I hi) 

i=l i=l 

where hi is the history at time i. In terms of conditional models, the constraints 
are presented as 

"^Pih) "^Piw I h) fa{h, w) = ^p{h, w) fa{h, w) 

h w h,w 

where /i is a history, and the maximum entropy model subject to these constraints 
is given by 



px{w I h) = 




The partition function Z\{h) is now obtained from summing over the target word 
vocabulary, rather than over all word strings. Constraining a family of conditional 
models in this manner is typically much more manageable computationally than 
working with a single constrained joint distribution. In addition, the use of condi- 
tional models is desirable for applications which process the input in a left-to-right 
fashion. 



3. Iterative Scaling 

The generalized iterative scaling algorithm of Darroch and Ratcliff [2] is one 
method for calculating the maximum entropy distribution (1). This algorithm 
assumes that the features fa{h, w) are non-negative and sum to a constant, inde- 
pendent of h and w: 

M(h,w) = J2fa(h,w) = M, forall/i,w. (2) 

a 

Given these restrictions, the Darroch-Ratcliff algorithm begins with an initial 
model, typically the uniform distribution obtained by setting Aq, = 0. In the it- 
erative step, when the current model is p\{w \ h), the algorithm increments each 



4 



JOHN D. LAFFERTY AND BFRNHARD SUHM 



parameter Aq, by an amount AAq, determined by 

_ J_. ( J2h.wPih,w) faih,w) \ 

"~ M °^[j:,,^p{h)px{w\h)Uh,w)) ■ 

Letting A/3q, = e^'^" , we can express this update as choosing A/3q, to be the unique 
solution of the equation 

j5A[/„A/3f]=p[/„] (3) 

where q[-] denotes expectation with respect to q and we use p\ to denote the 
distribution p\{h, w) = p{h) p\{w \ h). 

While the restriction (2) on M can always be enforced by introducing a "slack 
variable," it can be inconvenient to do so for conditional maximum entropy lan- 
guage models that typically have hundreds of thousands of features. In [1] an 
algorithm was introduced that extends the Darroch-Ratcliff procedure by relaxing 
the assumption that M[h,w) is a constant. The updates for the improved algo- 
rithm are again given by equation (3), but with M now interpreted as a random 
variable. When (2) holds, the algorithms are identical. In general, the algorithm 
which allows M to vary is more natural and easier to implement. It also converges 
more quickly, by effectively increasing the step size taken toward the maximum 
entropy solution at each iteration. 



4. Cluster Expansions 

4.1. THF MAYFR FXPANSION FOR A CLASSICAL GAS 

If the Hamiltonian for a classical #-particle system is given by iJ = ^ "Y^- pf -\- 
'^i<j ^'i ^^'^ system occupies a volume V , then the classical partition function 
of the system at temperature T is given by 



Q^iy^T)^ ^^^^ 

where [3 = l/kT and /i is a constant introduced to make Qjv dimensionless. Com- 
puting the integral over the momenta reduces this to 

Qn{V,T) = ^^L^ J^dq exp 1^-/3^% j = ■^^WJ^^^n{V,T) 



where A = \J / kT . The idea of the cluster expansion is to make a change of 
variables 

c^ij = e-""- - 1 
and expand Zjv as a sum of products of 4>ij : 

Zn{V,T)= I dql[{l + ct,,,)= I dq 1 + ^,/,,^. + ^^,/,,^.,/,,, + ... . 

"'^ i<j "'^ \ i<j i<j k<l ) 
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A convenient way to think about the integrals that need to be computed comes 
from expressing the various terms as graphs. If # = 3, for example, the integrands 
are represented as graphs as follows: 



012 1-^ 012 023 1-^ \ 012 023 013 1-^ 
• • • • 

12 12 12 

In terms of this correspondence, Zjv = X!1g'''(^)' where the sum is over all N- 
particle graphs and S{G) is the appropriate integral; for example, 

3 



1 2 



m 912 913 ■ 

V 



If a graph G is disconnected, then S{G) factors into a product of terms, and each 
connected component is referred to as a cluster. The Mayer cluster integral hi is 
given by k = l/l\ T.i. dusters G, S{Gi). Thus, 




Simple combinatorial arguments lead to an expression for Zjv in terms of the 
integrals While this is then carried further to obtain a series expansion for 
the grand partition function, our use of the method will simply make use of the 
discrete analogues of the integrals hi for conditional models. For more details on 
the statistical physics calculations we refer to [9]. 



4.2. CLUSTER EXPANSIONS EOR CONDITIONAL MAXENT MODELS 

The computation necessary to carry out the iterative scaling algorithm described 
in Section 3 is naturally divided into two parts. First, for a conditional maximum 
entropy model of the form (1), it is necessary to compute the partition functions 
Z\[h) for each history h such that p{h) > 0. Using the notation from statistical 
physics, we make the change of variables 4>a{h, w) = e^" ^"C*'™) — 1 so that Z\(h) 
can be expressed as 

w a w \ a a, a' j 

In analogy with the classical expansion, this expresses the normalization Z\{h) 
as a sum of cluster integrals, where '^^"^o,4'a{h,w) is the order one cluster, 
'I2w '12a a' 't'a 't'a' '^^ Order two cluster, and the highest order cluster that needs 
to be computed is the order-M cluster where M is the largest value of fa{h, w). 

This gives an exact expression for Zx{h) as a telescoping sum. The point of 
using this technique, as we will explain further in the following section, is that 
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computation of the individual clusters can be significantly more efficient than com- 
puting Z\{h) directly. Furthermore, the computation of the clusters can be shared 
across different histories. The use of Cheeseman's method [10,11] of reordering 
summations within a cluster can provide further savings. 

The second computation that is necessary is the calculation of the coefficients 
of A/3q, in the expectation p\[faA/3^] that appears in the scaling equation (3). In 
a manner similar to that described above, we expand in terms of c/)^ to obtain 

h ' w y 7 7,7' J 

Here again, indirect computation of the coefficients through the calculation of the 
individual cluster terms can be significantly more efficient than direct computation. 

The primary savings that this technique affords results from its avoidance of an 
explicit summation over the entire target vocabulary for each history. In addition, 
it can make hashing for feature lookup unnecessary. While we do not generally 
obtain better theoretical computational complexity, this simple trick can result 
in substantial savings in the computation necessary for carrying out generalized 
iterative scaling. We will now give further details of these calculations for a simple 
topic-dependent bigram model developed for use in a speech recognition system. 



5. Example: A Topic-Dependent Language Model 

In this section we describe the application of the cluster expansion to the train- 
ing of a topic-dependent bigram model of the Switchboard corpus [12] for use in 
a speech recognition system. This corpus comprises approximately three million 
words of text, transcribed from more than 150 hours of speech collected from 
telephone conversations. An important aspect of the Switchboard corpus is that 
the conversations are restricted to 70 different topics. To take advantage of this 
structure, we trained a maximum entropy language model whose constraints were 
of three types. In addition to unigram and bigram constraints, we introduced 
topic-dependent unigram constraints for those words having the greatest mutual 
information with the topic. 

More precisely, the model that we constructed was specified as follows. Condi- 
tioning on a word history h which ends in a word w' , the probability of predicting 
w is given by 

p(^w I h) = y~^p(topic = t \ h) p{u) \ h,t) = ^^p(topic = t\h) p{w \ w' , t) . 

t t 

This model has two components: a topic prediction model p(topic = t\h) and a 
word prediction model p{wj \ t,Wi). (The topic prediction model is not discussed 
here.) The word prediction model is constructed as a conditional maximum entropy 
distribution of the form 

p\{wj 1 1, Wi) = ]. exp + \i + Xtj) . 
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We thus place constraints on the model so that it agrees with the bigram and 
unigram frequencies as they appear in the data. In addition, we constrain the 
topic-dependent unigrams, corresponding to the parameters Xtj, for those words 
Wj that appear with sufficiently high mutual information with topic t. For example, 
the topic-independent bigram constraint equations take the form 



'^p{wi,t)p{wj \wi,t) = p{wi,Wj) = Cij 



where p is the empirical distribution, and the corresponding scaling equations 
update Xij by an amount AXij = logA/3jj, where A/3jj is the unique positive 
solution to the equation 



^p(wi,t)px(wj\wi,t) A(]f^ 



M(i,t,j) _ 
— 



The constraint and scaling equations for the parameters A,- and Xtj are similar. 

To apply the cluster expansion technique to this model we express the partition 
functions Z\[i,t) in terms of the variables = e'^ — 1 and expand Z\[i,t) = 
^^•(1 -|- </'j)(l -|- </'tj)(l + (fiij) into a sum of four cluster "integrals" 

Zxii,t) = bo + bi{i,t) +b2ii,t) +b3ii,t) . 

Using a variant of the physicists' graph notation that is appropriate for conditional 
models, we can express these terms as a sum over all configurations of a set of 
graphs; for example. 



h2{i,t) = S 



In these figures the unlabeled vertex is summed over, and an edge connecting the 
vertex labeled e denotes a unigram term (fij. Thus, 




b3{i,t) 




tj 9ij 



We use the fact that <f)„ = Q unless Aq, is a parameter that is being estimated. 
This is what allows the above telescoping summation to be carried out efficiently; 
for example, the summation 4'j4'ij is carried out only over those indices j for 
which the bigram [wi,Wj) is constrained. The largest cluster, bs{i,t), involves a 
summation over all those indices j for which the bigram [wi,Wj) is constrained 
and Wj is a topic word for topic t. The cluster integrals for the various values of 
{i,t) with p[wi,i) > can be calculated simultaneously by a single pass through 
appropriately constructed data structures, and require no expensive hashing of 
the bigram parameters. A very similar analysis is applied to the task of computing 
the coefficients of the iterative scaling equations for all of the parameters. When 
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we implemented this technique for the topic-dependent model, the resulting cal- 
culation was more than 200 times faster than the direct implementation of the 
iterative scaling algorithm. 

6. Summary 

Our use of the cluster expansion for the language model presented in Section 5 
demonstrates that this technique can be an important tool for reducing the com- 
putational burden of computing maximum entropy language models. The method 
also applies to higher order models such as "trigger models" [8], where occur- 
rences of words far back in the history can influence predictions by the use of 
long-distance bigram parameters. As a general technique, however, the method is 
limited in its usefulness. As in statistical mechanics, when the number of inter- 
acting constraints is large (i.e., when the gas is dense), the cluster expansion is of 
little use in computing the exact maximum entropy solution. For such cases the 
use of approximation techniques should be investigated. 
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